An Alphabet-Friendly FM-Index

نویسندگان

  • Paolo Ferragina
  • Giovanni Manzini
  • Veli Mäkinen
  • Gonzalo Navarro
چکیده

We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T )+O ( (n log log n)/ log|Σ| n ) bits, where Hk(T ) is the k-th order empirical entropy of T . The above bound holds simultaneously for all k ≤ α log|Σ| n and 0 < α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, the counting of the occurrences of an arbitrary pattern P [1, p] as a substring of T takes O(p log |Σ|) time. Locating each pattern occurrence takes O(log |Σ| (log n/ log log n)) time. Reporting a text substring of length l takes O((l+ log n/ log log n) log |Σ|) time.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set

String matching is a problem of finding all occurrences of a short pattern on a relatively long reference string. While a number of methods have been presented, most published implementations assume several restrictions due to some practical issues. We focus on the restriction of the alphabet size, which is usually set to be 256 in many string matching libraries. When strings must be handled ov...

متن کامل

List of Contributions The Pre - history and Future of the Block - Sorting Compression Algorithm 4

The FM-index is a succinct text index needing only O(Hkn) bits of space, where n is the text size and Hk is the kth order entropy of the text. FM-index assumes constant alphabet; it uses exponential space in the alphabet size, σ. In this paper we show how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ. Our...

متن کامل

First Huffman, Then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index

Main Results. The basic string matching problem is to determine the occurrences of a short pattern P = p1p2 . . . pm in a large text T = t1t2 . . . tn, over an alphabet of size σ. Indexes are structures built on the text to speed up searches, but they used to take up much space. In recent years, succinct text indexes have appeared. A prominent example is the FM-index [2], which takes little spa...

متن کامل

A simple alphabet-independent FM-index

We design a succinct full-text index based on the idea of Huffmancompressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zero-order entropy H0, our index needs O(n(H0 + 1)) bits of space, wi...

متن کامل

FM-index for Dummies

The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rankhandling data structure. In this work, we propose a n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004